Data Download

# Define global constant
GSE_NUM <- "GSE209552"

GEOmetadb setup

To setup GEOmetadb Tutorial, follow GEOmetadb Tutorial Chapter 3

# Install required package if not installed
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))
  BiocManager::install("GEOmetadb")
if (!requireNamespace("DBI", quietly = TRUE))
  install.packages("DBI")
if (!requireNamespace("RSQLite", quietly = TRUE))
  install.packages("RSQLite")
if (!requireNamespace("dplyr", quietly = TRUE))
  install.packages("dplyr")
if (!requireNamespace("ggplot2", quietly = TRUE))
  install.packages("ggplot2")

# Attach packages
library(GEOmetadb)
## Loading required package: R.utils
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.27.1 (2025-05-02 21:00:05 UTC) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
## 
##     throw
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save
## R.utils v2.13.0 (2025-02-24 21:20:02 UTC) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, isOpen, nullfile, parse, use, warnings
## Loading required package: RSQLite
library(DBI)
library(RSQLite)

# Download SQLite database if not exist
if (!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
file.info('GEOmetadb.sqlite')
# Connect to database
con <- dbConnect(SQLite(), 'GEOmetadb.sqlite')

Download whole GSE209552

Download data, follow Identifier Mapping Tutorial. The helper function defined below is modified based on Identifier Mapping Tutorial’s helper function to avoid duplicated download.

# Define helper function
fetch_geo_supp <- function(gse, destdir = "data") {
  # Calculate the actual path where GEOquery puts the files
  target_path <- file.path(destdir, gse)
  
  # Check if directory exists AND contains files
  if (dir.exists(target_path) && length(list.files(target_path)) > 0) {
    message(paste("Data found at", target_path, "- Skipping download."))
    return(invisible(target_path))
  }
  
  # If missing or empty, proceed with download
  message(paste("Downloading", gse, "to", destdir, "..."))
  dir.create(destdir, showWarnings = FALSE, recursive = TRUE)
  GEOquery::getGEOSuppFiles(GEO = gse, baseDir = destdir, makeDirectory = TRUE)
  
  invisible(target_path)
}

# Download
fetch_geo_supp(gse = GSE_NUM)
## Data found at data/GSE209552 - Skipping download.

Construct snRNA-seq only reference list

This GSE contain both snRNA-seq and bulk RNA-seq sample. Removal of bulk RNA-seq is needed to avoid significant batch effect due to their technical difference. Although the number of samples is reduced to 17, it’s still sufficient amount.

Refer to the original paper’s method section, only the snRNA-seq was done by Illumina NextSeq6000 platform, GPL24676. We can use this clue to exclude all bulk RNA-seq.

Since all GSMs of this study are packed in one GSE209552_RAW.tar file in GEO, we can’t indicate which platform to include/exclude at download step. Thus we will construct a list of GPL24676 GSMs for downstream reference by query GEOmetabd.

# Define desired platform
snRNA_platform <- "GPL24676"

# Construct the query
# We join the 'link' table (gse_gsm) with the 'sample' table (gsm) 
# to check the platform (gpl) column.
sql_filter <- paste0(
  "SELECT t1.gsm ",
  "FROM gse_gsm t1 ",
  "JOIN gsm t2 ON t1.gsm = t2.gsm ",
  "WHERE t1.gse = '", GSE_NUM, "' ",
  "AND t2.gpl = '", snRNA_platform, "'"
)

# Get the list of pure snRNA-seq samples
sn_gsm_list <- dbGetQuery(con, sql_filter)$gsm

# View results
print(paste("Found", length(sn_gsm_list), "snRNA-seq samples."))
## [1] "Found 17 snRNA-seq samples."
sn_gsm_list
##  [1] "GSM6376803" "GSM6376804" "GSM6376805" "GSM6376806" "GSM6376807"
##  [6] "GSM6376811" "GSM6376812" "GSM6376813" "GSM6376814" "GSM6376815"
## [11] "GSM6376816" "GSM6376817" "GSM6376818" "GSM6376819" "GSM6376820"
## [16] "GSM6376821" "GSM6376822"

Access data quality

# Install Seurat if not
if (!requireNamespace("Seurat", quietly = TRUE))
  install.packages("Seurat")

Reorganize snRNA-seq files

The raw count data was stored in GSE209552_RAW.tar file, thus need to be unpacked using untar() function.

# Attach packages
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.5.0 but the current version is
## 4.5.2; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t
library(data.table)
library(Matrix)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

# Define data paths
tar_file <- "data/GSE209552/GSE209552_RAW.tar"
extract_dir <- "data/GSE209552/raw_files"

# Define unpack helper function
unpack_geo_tar <- function(tar_path, dest_dir) {
  if (dir.exists(dest_dir) && length(list.files(dest_dir)) > 0) {
    message(paste("📂 Extracted files found at", dest_dir, "- Skipping unpack."))
    return(invisible(dest_dir))
  }
  if (!file.exists(tar_path)) {
    stop(paste("❌ Tar file not found:", tar_path))
  }
  message(paste("📦 Unpacking", basename(tar_path), "..."))
  dir.create(dest_dir, showWarnings = FALSE, recursive = TRUE)
  untar(tar_path, exdir = dest_dir)
  message("✅ Unpacking complete.")
  return(invisible(dest_dir))
}

# Run the unpacker
unpack_geo_tar(tar_file, extract_dir)
## 📂 Extracted files found at data/GSE209552/raw_files - Skipping unpack.

The unpacked file is very messy, contain all 10x genomics snRNA-seq files, bulk RNA-seq files, subsetted Transposable Elements file, clustered files, and normalized files. Among those, only 10x genomics snRNA-seq is what we need.

To extract, we utilize previously defined sn_gsm_list, loop over it, exclude any file contain _TE_, indicate Transposable Elements, only extract files end with either barcode, matrix, or features. Lastly, trim the file name to barcode, matrix, features only, to meet the expected naming format of Seurat’s Read10X() function.

# Reorganize all snRNA-seq 10x files in to a new directory
dest_dir <- "data/GSE209552/10x_organized"    # Where they should go
dir.create(dest_dir, recursive = TRUE, showWarnings = FALSE)

# --- MAIN LOOP ---
message(paste("🚀 Organizing files for", length(sn_gsm_list), "samples..."))
## 🚀 Organizing files for 17 samples...
skipped_count <- 0
moved_count   <- 0

for (gsm in sn_gsm_list) {
  
  # 1. Define the specific destination path for this sample
  sample_subdir <- file.path(dest_dir, gsm)
  
  # 2. CHECK: Does it already exist?
  if (dir.exists(sample_subdir) && length(list.files(sample_subdir)) >= 3) {
    skipped_count <- skipped_count + 1
    next
  }
  
  # 3. Create directory
  dir.create(sample_subdir, showWarnings = FALSE)
  
  # 4. Find source files (Exclude the 'TE' files)
  pattern <- paste0("^", gsm)
  files_to_move <- list.files(extract_dir, pattern = pattern, full.names = TRUE)
  files_to_move <- files_to_move[!grepl("_TE_", files_to_move)]
  
  # Safety Checks
  if (length(files_to_move) == 0) {
    warning(paste("⚠️ Skipping", gsm, "- No source files found."))
    next
  }
  
  # 5. MOVE & RENAME
  # We identify specific files to ensure we rename them correctly for Seurat
  f_matrix   <- files_to_move[grep("matrix", files_to_move, ignore.case = TRUE)]
  f_barcodes <- files_to_move[grep("barcodes", files_to_move, ignore.case = TRUE)]
  f_features <- files_to_move[grep("features|genes", files_to_move, ignore.case = TRUE)]
  
  # Verify we found exactly 1 of each before moving
  if (length(f_matrix) == 1 && length(f_barcodes) == 1 && length(f_features) == 1) {
    
    # Rename Matrix -> matrix.mtx.gz
    file.rename(f_matrix, file.path(sample_subdir, "matrix.mtx.gz"))
    
    # Rename Barcodes -> barcodes.tsv.gz
    file.rename(f_barcodes, file.path(sample_subdir, "barcodes.tsv.gz"))
    
    # Rename Genes/Features -> features.tsv.gz
    file.rename(f_features, file.path(sample_subdir, "features.tsv.gz"))
    
    message(paste("  ✅ Standardized & Moved:", gsm))
    moved_count <- moved_count + 1
    
  } else {
    warning(paste("  ⚠️ Skipping", gsm, "- Could not find unique 10x triplet to standardize."))
  }
}

# --- SUMMARY ---
message("🎉 Organization Complete.")
## 🎉 Organization Complete.
message(paste("  🔹 Samples moved & standardized:", moved_count))
##   🔹 Samples moved & standardized: 0
message(paste("  🔸 Samples skipped (already done):", skipped_count))
##   🔸 Samples skipped (already done): 17

Load snRNA-seq files into R using Seurat

Loading of expression data is done by iteratively call Seurat’s Read10X() and CreateSeuratObject() function over all GSMs. They cooperatively first read a 10X genomics structured file and create a Seurat Object, which in the basic unit for any data manipulation in Seurat.

# Get the list of sample subdirectories
sample_dirs <- list.dirs(dest_dir, recursive = FALSE)

# Define a list storing seurat obj of all GSMs
seurat_list <- list()

message(paste("🚀 Loading", length(sample_dirs), "samples..."))
## 🚀 Loading 17 samples...
for (dir in sample_dirs) {
  # Extract GSM ID from the folder name (e.g., "GSM6385438")
  gsm_id <- basename(dir)
  
  # A. Read 10x Data
  # Because you renamed the files, Read10X finds them automatically!
  counts <- Read10X(data.dir = dir)
  
  # B. Create Seurat Object
  seurat_list[[gsm_id]] <- CreateSeuratObject(
    counts = counts, 
    project = gsm_id, 
    min.cells = 0, 
    min.features = 0
  )
  
  message(paste("  ✅ Loaded:", gsm_id))
}
##   ✅ Loaded: GSM6376803
##   ✅ Loaded: GSM6376804
##   ✅ Loaded: GSM6376805
##   ✅ Loaded: GSM6376806
##   ✅ Loaded: GSM6376807
##   ✅ Loaded: GSM6376811
##   ✅ Loaded: GSM6376812
##   ✅ Loaded: GSM6376813
##   ✅ Loaded: GSM6376814
##   ✅ Loaded: GSM6376815
##   ✅ Loaded: GSM6376816
##   ✅ Loaded: GSM6376817
##   ✅ Loaded: GSM6376818
##   ✅ Loaded: GSM6376819
##   ✅ Loaded: GSM6376820
##   ✅ Loaded: GSM6376821
##   ✅ Loaded: GSM6376822
length(seurat_list)
## [1] 17

However, the metadata (contain condition information) need to fetched from GEOmetadb, and extract from GSM table’s characteristics_ch1 column, and then insert into corresponding Seurat object’s meta.data layer.

# Load GSMs metadata from GEOmetadb
gsm_list_string <- paste(paste0("'", sn_gsm_list, "'"), collapse = ",")
sql <- paste0(
  "SELECT gsm, title, characteristics_ch1 ",
  "FROM gsm ",
  "WHERE gsm IN (", gsm_list_string, ")"
)
metadata <- dbGetQuery(con, sql)

# Merge the condition metadata to corresponding Seurat obj's metadata
metadata$Condition <- sub(".*condition:\\s*([^;]+).*", "\\1", metadata$characteristics_ch1, ignore.case = TRUE)

# Loop and assign
for (gsm in names(seurat_list)) {
  # Find the condition for this GSM
  val <- metadata$Condition[metadata$gsm == gsm]
  
  # Assign to Seurat object (handling missing values)
  if (length(val) > 0) {
    seurat_list[[gsm]]$Condition <- val
  } else {
    seurat_list[[gsm]]$Condition <- "Unknown"
  }
}

# Final Merge all samples into one seurat obj
combined_seurat <- merge(
 x = seurat_list[[1]], 
 y = seurat_list[-1], 
 add.cell.ids = names(seurat_list), 
 project = "GSE209552"
)
# Verify
table(combined_seurat$Condition)
## 
## control     tbi 
##   10666   13955

Free memory

# Disconnect SQLite connection
dbDisconnect(con)

# Delete memory pointer from R
rm(con)
rm(seurat_list)
rm(counts)
rm(metadata)

# # De allocate unused memory
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3922891 209.6    7404994  395.5   7165144  382.7
## Vcells 83195536 634.8  239034714 1823.7 239023057 1823.7

Data exploriation

Plot the following statistics:

  • Number of unique identifier (infer coverage)
  • Number of total amount of identifier (infer depth)
  • Number of samples
  • Number of cells
library(ggplot2)
library(patchwork)

theme_set(theme_classic(base_size = 26))

# 1. Coverage (nFeature_RNA)
p_cov <- VlnPlot(
  combined_seurat, 
  features = "nFeature_RNA", 
  group.by = "Condition", 
  pt.size = 0
) + 
  ggtitle("Coverage") + 
  ylab("nFeature_RNA")
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
# 2. Depth (nCount_RNA) with adjusted axis
p_depth <- VlnPlot(
  combined_seurat, 
  features = "nCount_RNA", 
  group.by = "Condition", 
  pt.size = 0
) + 
  ggtitle("Depth") + 
  ylab("nCount_RNA") +
  scale_y_continuous(limits = c(0, 50000)) 
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
# Combine them to recreate p_metrics
p_metrics <- p_cov | p_depth

# Number of cells among conditions
meta_data <- combined_seurat@meta.data

# A. Bar plot: Total Number of Cells per Condition
p_cells <- ggplot(meta_data, aes(x = Condition, fill = Condition)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Group", y = "Number of Cells", x = NULL) +
  theme(legend.position = "none")

# Number of samples among conditions
p_samples <- meta_data %>%
  group_by(Condition) %>%
  summarize(Sample_Count = n_distinct(orig.ident)) %>%
  ggplot(aes(x = Condition, y = Sample_Count, fill = Condition)) +
  geom_bar(stat = "identity", color = "black") +
  geom_text(aes(label = Sample_Count), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Samples (GSMs) per Group", y = "Number of Samples (GSMs)", x = NULL) +
  theme(legend.position = "none")

# Merge plots
final_plot <- p_metrics / (p_samples | p_cells)

# Adjust font size
final_plot <- final_plot & theme_classic(base_size = 16)
final_plot
## Warning: Removed 352 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

Figure 1: Overall statistics of dataset GSE209552 grouped by condition (Control v.s. tbi). The upper-left panel indicate the number of unique identifier; The upper-right panel indicate the total amount of identifier, the maximum amount of identifier was limited up to 50000 for the sake of aesthetics; The lower-left panel indicate number of samples (GSMs); The lower-right panel indicate the number of cells”; X-axis of all four panel indicate condition.

Generally, control group have higher depth and coverage, but less sample and cells.

Then plot the statistics below per sample (GSM):

  • Number of unique identifier (infer coverage)
  • Number of total amount of identifier (infer depth)
  • Number of cells
# Define the vertical line style once to keep it consistent
vertical_line <- geom_vline(xintercept = 5.5, linetype = "dashed", color = "red", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# 1. Coverage (nFeature_RNA)
p_cov <- VlnPlot(
  combined_seurat, 
  features = "nFeature_RNA", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  ggtitle("Coverage") + 
  ylab("nFeature_RNA") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank() # Often cleaner to remove x-label if text is rotated
  ) + vertical_line
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
# 2. Depth (nCount_RNA) with adjusted axis
p_depth <- VlnPlot(
  combined_seurat, 
  features = "nCount_RNA", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  ggtitle("Depth") + 
  ylab("nCount_RNA") +
  scale_y_continuous(limits = c(0, 50000)) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none",
    axis.title.x = element_blank()
  ) + vertical_line
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
# A. Bar plot: Total Number of Cells per Sample
meta_data <- combined_seurat@meta.data # Ensure meta_data is defined
p_cells <- ggplot(meta_data, aes(x = orig.ident, fill = orig.ident)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Sample", y = "Number of Cells", x = NULL) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  coord_fixed(ratio = 0.0005)  +
  scale_y_continuous(limits = c(0, 6000)) + vertical_line

# Merge plots
final_plot <- p_cov / p_depth / p_cells + 
  plot_layout(heights = c(1, 1, 1))

# Adjust font size (and re-apply theme elements if needed by the base theme)
final_plot <- final_plot & theme_classic(base_size = 16) & 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

final_plot
## Warning: Removed 352 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

Figure 2: Overall statistics per sample (GSM). The top panel indicate number of unique identifier; The middle panel indicate number of total identifier; The bottom panel indicate number of cells; X-axis of all three panel indicate samples (GSMs); The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

In general, the sequence depth and coverage is more consistence among control group. Tbi group has more variated depth and coverage with GSM637681 being the sample of lowest depths and coverage. The number of cells within different sample also varies, from highest 3708 (GSM6376822) to lowest 358 (GSM6376820). This imbalance in cell number may result in downstream analysis dominate by sample which contain more cells.

Another common quality control step in scRNA-seq is to remove damaged cells. When damaged cells were sequenced or damaged during sequencing, the expression intensity will scaled down due to leaked RNA, thus significantly bias downstream clustering, annotation, and differential gene expression analysis. A common method to conduct this filtering is using mitochondrial transcript percentage, based on this idea that if a cell has leaked RNA, the mitochondrial transcript percentage will be relatively higher since mitochondria is relatively large and less likely to leak. Empirically, the threshold is set to be 5% ~ 10%

However, when it comes to snRNA-seq, mitochondrial transcript percentage is expected to be 0% because of the exclusion of cytoplasm, and any none-zero mitochondrial transcript percentage indicate incomplete exclusion of cytoplasm Amezquita et al, 2020. Thus we may want to choice a more strict threshold for snRNA-seq, like 1% ~ 2%, to match the wet-lab design while tolerate low-quality cell to certain extent.

# Use seurat PercentageFeatureSet() function
# Create a new entry in meta.data layer storing mitochondrial transcript percentage
combined_seurat[["percent.mt"]] <- PercentageFeatureSet(combined_seurat, pattern = "^MT-")

# Defin a horizontal line at Y == 5 indicate empirical mito% threashold
horizontal_line <- geom_hline(yintercept = 10, linetype = "dashed", color = "red", size = 1)

# Percent MT per CONDITION (Control vs TBI)
p_condition <- VlnPlot(combined_seurat, features = c("percent.mt"), group.by = "Condition", pt.size = 0) +
  theme(
    axis.title.x = element_blank(), 
    legend.position = "none" # Hide legend since x-axis already shows the names
  ) + 
  ggtitle("Mitochondrial % by Condition") + 
  horizontal_line + 
  ylab("Mitochondrial Transcript (%)") + 
  annotate("text", x = 0.5, y = 12, label = "mito % = 10", color = "black", hjust = 0, size = 5)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
# Percent MT per GSM
p_sample <- VlnPlot(combined_seurat, features = c("percent.mt"), group.by = "orig.ident", pt.size = 0) +
  theme(
    axis.title.x = element_blank(), 
    # Rotate x-axis labels 45 degrees so they don't overlap
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none" # Hide legend since x-axis already shows the names
  ) + vertical_line + 
  ggtitle("Mitochondrial % by Sample") + 
  horizontal_line + 
  ylab("Mitochondrial Transcript (%)") + 
  annotate("text", x = 0.5, y = 12, label = "mito % = 10", color = "black", hjust = 0, size = 5)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
# 4. Combine them
final_mt_plot <- p_condition / p_sample
final_mt_plot

Figure 3: Mitochondrial transcript percentage grouped by GSMs. The Y-axis indicate mitochondrial transcript percentage; The X-axis of top panel indicate condition (Control v.s.); The X axis of bottom panel indicate GSM samples; The horizontal red dash line indicate empirical Mitochondrial transcript percentage threshold = 10%; The vertical red dash line at bottom panel separate control (left side of red line), and treatment (right side of red line).

However, the Mitochondrial transcript percentage is way higher than expected in all samples and the original paper didn’t assess the Mitochondrial transcript percentage at all. One clue that may explain this unexpected high Mitochondrial transcript percentage from their Limitations of the study is control tissues were post-mortem sample and tbi tissues were surgically evacuated rare case post-impact. Thus the integrity of samples were not at ideal condition and expected to contain damaged cells.

On the other hands, mitochondrial transcript percentage aligned with previous depth and coverage among samples, where control group tend to be more consistent. A hypothesis is that Traumatic Brain Injury (TBI) is a broad term that include various brain damage result from external force, and the intensity and exact pathology may vary among different tbi sample, reflected by sequence depth and coverage. Noticeably, GSM6376814 has relatively higher mitochondrial transcript percentage, and recall that it’s also the sample having lowest depth and coverage, which might imply existing technical variance and/or batch effect that could bias downstream analysis.

Mapping to HUGO symbols

Check the current identifier

head(rownames(combined_seurat))
## [1] "MIR1302-2HG" "FAM138A"     "OR4F5"       "AL627309.1"  "AL627309.3" 
## [6] "AL627309.2"

At a glance, it’s HUGO symbol. To verify, we can query those symbol against org.Hs.eg.db.

# Install org.Hs.eg.db
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE)) {
  BiocManager::install("org.Hs.eg.db")
}
## 
# Attach org.Hs.eg.db to current session
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
## 
## Attaching package: 'generics'
## The following object is masked from 'package:dplyr':
## 
##     explain
## The following object is masked from 'package:R.utils':
## 
##     evaluate
## The following object is masked from 'package:R.oo':
## 
##     compile
## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union
## 
## Attaching package: 'BiocGenerics'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
## 
##     first, rename
## The following objects are masked from 'package:Matrix':
## 
##     expand, unname
## The following objects are masked from 'package:data.table':
## 
##     first, second
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
## 
##     collapse, desc, slice
## The following object is masked from 'package:data.table':
## 
##     shift
## The following object is masked from 'package:sp':
## 
##     %over%
## The following object is masked from 'package:R.oo':
## 
##     trim
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
## 
##     select
# Check how many of them exist in the official symbol database
valid_symbols <- mapIds(org.Hs.eg.db, 
                        keys = rownames(combined_seurat), 
                        column = "SYMBOL", 
                        keytype = "SYMBOL", 
                        multiVals = "first")

# Calculate the percentage of matches
sum(!is.na(valid_symbols)) / length(rownames(combined_seurat)) * 100
## [1] 100

100% valid symbols confirmed the symbol is acually HUGO symbol.

Cleaning

First we remove GSM6376814, since it has the lowest depth, coverage, and highest mitochondrial transcript percentage, thus pron to contaminate downstream analysis.

combined_seurat <- subset(combined_seurat, subset = orig.ident != "GSM6376814")

Based on Seurat and Heumos et al, 2023, they simply applied hard threshold on mitochondrial transcript percentage, 5% and 8%, at their quality control step. Additionally, refer to previous Figure 3, I empirically choice the threshold to be 10%.

combined_seurat <- subset(combined_seurat, subset = percent.mt <= 10)

In the end, 2887/24621 = 11.7% cells were removed. Compare with original paper, they removed 9511/24621 = 38.6% cells. I suppose part of the reason they apply such restrict quality control is they are aware of the low integrity of samples.

# 1. Mito % Plot (After Cleaning)
p_mito_clean <- VlnPlot(
  combined_seurat, 
  features = "percent.mt", 
  group.by = "orig.ident", 
  pt.size = 0
) + 
  vertical_line + 
  ggtitle("Mitochondrial % per Sample (After Cleaning)") + 
  ylab("Mitochondrial Transcript (%)") + 
  scale_y_continuous(limits = c(0, 10)) + # Zoom in since max is now <= 10
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1), 
    legend.position = "none",
    axis.title.x = element_blank()
  ) +
  coord_fixed(ratio = 0.5)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
## Scale for y is already present.
## Adding another scale for y, which will replace the existing scale.
# 2. Total Cells Plot (Your existing code)
meta_data <- combined_seurat@meta.data 
p_cells <- ggplot(meta_data, aes(x = orig.ident, fill = orig.ident)) +
  geom_bar(color = "black") +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  theme_classic() +
  labs(title = "Total Cells per Sample (After Cleaning)", y = "Number of Cells", x = NULL) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  coord_fixed(ratio = 0.001) +
  scale_y_continuous(limits = c(0, 4000)) + 
  vertical_line

# 3. Combine them vertically
final_clean_plot <- p_mito_clean / p_cells
final_clean_plot
## Warning: Removed 414 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

Figure 4: Number of cells after cleaning grouped by GSMs. The Y-axis indicate number of cells; The X-axis indicate GSMs; The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

Normalization

Differ from RNA-seq and bulk RNA-seq normalization we covered in class, sc/snRNA-seq data is zero-inflated, most of gene’s counts is 0 in most of cells. Thus two of the method we covered in class, TMM and RLE, are not suitable for sc/snRNA-seq because they both assume most of the gene are not differentially expressed, zero-inflated counts data will yield inflated fold change when ever an algorithm is trying to calculate fold change (Lytal et al, 2020).

Library size based Log-Normalization & filter low expressed gene

The simple normalization by library size still work as intended and won’t affect by inflated zeros and widely utilized in sing-cell transcriptome field as a baseline (Ge et al, 2025, Lytal et al, 2020, Amezquita et al, 2019). In Seurat, it’s wrapped as NormalizeData() function. Note the default scale.factor is set to be 10000 to account for sc/snRNA-seq’s relative small reads per cell compare with bulk RNA-seq.

For visualization, I choice first randomly sample one cell from each GSM and plot the pre and post normalized counts on Y-axis, sampled cell on X-axis. The reason of not plot normalized counts per GSM is because in sc/snRNA-seq, the counts is normalized by the total reads of a cell rather than GSM, thus plot normalized counts per GSM won’t be so meaningful.

In both Seurat tutorial, and Heumos et al, 2023, they used the minimum number of cell expressing certain gene to remove low expressed gene instead of threaholding expression level like in edgeR. Thus I will also filter low expressed gene based on minimum number of cell.

library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack
## The following object is masked from 'package:R.utils':
## 
##     extract
library(tibble)

# Collapse the split sample layers into a single unified 'counts' layer
combined_seurat <- JoinLayers(combined_seurat)

# Create a copy of un-normalized data for test purpose
seurat_copy <- combined_seurat

# Randomly sample exactly ONE cell per GSM (orig.ident)
set.seed(42) 
sampled_cells <- combined_seurat@meta.data %>%
  rownames_to_column("Cell_ID") %>%
  group_by(orig.ident) %>%
  sample_n(1) %>%
  pull(Cell_ID)

# Create a mini-Seurat object of just these cells for faster extraction
sc_unfiltered <- subset(combined_seurat, cells = sampled_cells)

# ==========================================
# PLOT 1: Pre-Normalization (Raw Counts)
# ==========================================
raw_matrix <- as.matrix(GetAssayData(sc_unfiltered, layer = "counts"))

df_raw <- as.data.frame(raw_matrix) %>%
  pivot_longer(cols = everything(), names_to = "Cell_ID", values_to = "Count") %>%
  mutate(Sample = sc_unfiltered@meta.data[Cell_ID, "orig.ident"])

p_pre_norm <- ggplot(df_raw, aes(x = Sample, y = Count, fill = Sample)) +
  geom_violin(scale = "width", color = "black", alpha = 0.8) + 
  ggtitle("1. Pre-Normalization: Raw Counts (Unfiltered)") + 
  ylab("Expression (Linear)") +
  theme_classic() +
  theme(axis.text.x = element_blank(), legend.position = "none", axis.title.x = element_blank()) + 
  vertical_line + 
  coord_cartesian(ylim = c(0, 200))


# ==========================================
# PLOT 2: Post-Normalization (Before Filtering)
# ==========================================
# Normalize the main dataset first
combined_seurat <- NormalizeData(combined_seurat, normalization.method = "LogNormalize", scale.factor = 10000)
## Normalizing layer: counts
# Update our mini-Seurat object to get the new normalized 'data' layer
sc_unfiltered <- subset(combined_seurat, cells = sampled_cells)

norm_matrix_unfilt <- as.matrix(GetAssayData(sc_unfiltered, layer = "data"))

df_norm_unfilt <- as.data.frame(norm_matrix_unfilt) %>%
  pivot_longer(cols = everything(), names_to = "Cell_ID", values_to = "Normalized_Count") %>%
  mutate(Sample = sc_unfiltered@meta.data[Cell_ID, "orig.ident"])

p_post_unfilt <- ggplot(df_norm_unfilt, aes(x = Sample, y = Normalized_Count, fill = Sample)) +
  geom_violin(scale = "width", color = "black", alpha = 0.8) + 
  ggtitle("2. Post-Normalization: Log-Normalized (Unfiltered)") + 
  ylab("Expression (Log1p)") + 
  theme_classic() +
  theme(axis.text.x = element_blank(), legend.position = "none", axis.title.x = element_blank()) + 
  vertical_line + 
  coord_cartesian(ylim = c(0, 6))


# ==========================================
# PLOT 3: Post-Normalization (After Filtering)
# ==========================================
# Calculate frequencies and filter the main dataset
raw_counts_all <- GetAssayData(combined_seurat, layer = "counts")
cells_per_gene <- rowSums(raw_counts_all > 0)
min_cells <- 10
genes_to_keep <- names(which(cells_per_gene >= min_cells))

message("Number of genes BEFORE filtering: ", nrow(raw_counts_all))
## Number of genes BEFORE filtering: 33538
message("Number of genes AFTER filtering (>= ", min_cells, " cells): ", length(genes_to_keep))
## Number of genes AFTER filtering (>= 10 cells): 25389
message("Total genes removed: ", nrow(raw_counts_all) - length(genes_to_keep))
## Total genes removed: 8149
# Subset the main object to remove noisy, low-expressed genes
combined_seurat_filtered <- subset(combined_seurat, features = genes_to_keep)

# Create a new mini-Seurat object based on the FILTERED dataset
sc_filtered <- subset(combined_seurat_filtered, cells = sampled_cells)

norm_matrix_filt <- as.matrix(GetAssayData(sc_filtered, layer = "data"))

df_norm_filt <- as.data.frame(norm_matrix_filt) %>%
  pivot_longer(cols = everything(), names_to = "Cell_ID", values_to = "Normalized_Count") %>%
  filter(Normalized_Count > 0) %>%
  mutate(Sample = sc_filtered@meta.data[Cell_ID, "orig.ident"])

p_post_filt <- ggplot(df_norm_filt, aes(x = Sample, y = Normalized_Count, fill = Sample)) +
  geom_violin(scale = "width", color = "black", alpha = 0.8) + 
  ggtitle(paste0("3. Post-Normalization: Log-Normalized (Filtered >= ", min_cells, " cells)")) + 
  ylab("Expression (Log1p)") + 
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none", axis.title.x = element_blank()) + 
  vertical_line + 
  coord_cartesian(ylim = c(0, 6))

# ==========================================
# COMBINE PLOTS
# ==========================================
# Stack them vertically 
final_norm_plot <- p_pre_norm / p_post_unfilt / p_post_filt
final_norm_plot

Figure 5: Pre and post normalized counts per cell ( randomly sample 1 cell per GSM). The upper panel indicate the raw counts distribution of the randomly sampled cell; The middle panel indicate the normalized counts of the randomly sampled cell; The bottom panel indicate the distribution of normalized counts after exclude all genes that expressed in less than 10 cells; X-axis indicate the GSMs that cell is sampled from; The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

Before filtering, the raw and normalized count are both inflated at zero due to the sparse nature of sc/snRNA-seq. After filtering, normalized counts yield interesting vases shape, a flat and wide base with numbers of beads attached upon. The flat base is likely because of the zero-inflated nature of sc/snRNA-seq, even though I exclude all gene that expressed in less than 10 cells, most gene still have counts close to zero, following negative binomial distribution. However, I don’t have a well constructed hypothesis to explain those beads attached, maybe because the raw input (raw counts) in discrete, the log-Normalized counts will also be discrete. The relative height of the base is reflecting sequence depth of those cell. When those counts close to zero were normalized by total read of a cell, the larger the total read, the smaller the normalized count. Thus the higher the base, the lower the sequence depth.

Note : X-axis is the GSM those cells sampled from, not GSM itself. So we can’t infer sequence depth of GSM from Figure 5.

Free memory

# Delete memory pointer from R
rm(combined_seurat)
rm(df_norm_filt)
rm(df_norm_unfilt)
rm(df_raw)
rm(norm_matrix_filt)
rm(norm_matrix_unfilt)
rm(raw_counts_all)
rm(raw_matrix)
rm(sc_filtered)
rm(sc_unfiltered)
rm(meta_data)

# De allocate unused memory
gc()
##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   7123985  380.5   13153052  702.5  13153052  702.5
## Vcells 384023613 2929.9  765902692 5843.4 678936786 5179.9

Z-score Normalization & Dimention reduction

To avoid curse of dimentionality, meaning the distance between every data point is basically the same in high dimension data, dimension reduction is another common step to conduct before differential gene analysis.

First perform Z-score Normalization to linearly shift the mean expression across cells at 0, and standard deviation as 1. This functionally is wrapped in ScaleData() in Seurat. For the sake of memory, only scale the top 2000 gene defined by seurat’s FindVariableFeatures() function.

# Find the top 2000 gene
combined_seurat_filtered <- FindVariableFeatures(
  combined_seurat_filtered, 
  selection.method = "vst", 
  nfeatures = 2000
)
## Finding variable features for layer counts
# Z-score normalization
combined_seurat_filtered <- ScaleData(combined_seurat_filtered)
## Centering and scaling data matrix
# Create a new mini-Seurat object based on the FILTERED dataset
sc_scaled <- subset(combined_seurat_filtered, cells = sampled_cells)

# 2. Extract Post-Scaling Data (Z-scores)
# Use layer = "scale.data" to pull the matrix you just created
scaled_matrix <- as.matrix(GetAssayData(sc_scaled, layer = "scale.data"))

df_scaled <- as.data.frame(scaled_matrix) %>%
  pivot_longer(cols = everything(), names_to = "Cell_ID", values_to = "Scaled_Expression") %>%
  # NOTE: We do NOT filter > 0 here. Negative values are important Z-scores!
  mutate(Sample = sc_scaled@meta.data[Cell_ID, "orig.ident"])

# 3. Plot the Scaled Distribution
p_scaled <- ggplot(df_scaled, aes(x = Sample, y = Scaled_Expression, fill = Sample)) +
  geom_violin(scale = "width", color = "black", alpha = 0.8) + 
  ggtitle("Post-Scaling: Z-scored Expression (1 Cell per GSM)") + 
  ylab("Expression (Z-score)") + 
  theme_classic() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1), 
    legend.position = "none", 
    axis.title.x = element_blank()
  ) + vertical_line

p_scaled

Figure 6: Expression level after Z-score normalization. Y-axis is the Z-score normalized expression level; X-axis is each single cell sampled from GSM; The vertical red dash line separate control (left side of red line), and tbi (right side of red line).

Then perform principal Component Analysis (PCA), a linear dimension reduction and plot corresponding MDS plot using the first two principle component. The PCA is conduct via seurat’s RunPCA() function. The default parameter crashed R on my machine, so for the sake of performance, only run PCA on the top 2000 variable genes and top 20 PCs.

# Run PCA
combined_seurat_filtered <- RunPCA(combined_seurat_filtered, 
                                   features = VariableFeatures(object = combined_seurat_filtered), 
                                   npcs = 20, 
                                   verbose = TRUE)
## PC_ 1 
## Positive:  LDB2, NELL2, ZNF385B, UNC5D, CDH18, OLFM3, LINGO2, FAM19A2, ST6GALNAC5, CNTNAP5 
##     ASIC2, KCNB2, KCTD16, LINC01250, FSTL4, SV2B, MLIP, CNTN5, KCNH7, SLC35F3 
##     MIR137HG, PTPRT, KCNC2, AC019211.1, LY86-AS1, CADPS2, SLC22A10, DGKB, NRG1, GRIK2 
## Negative:  NEAT1, SLC1A3, FBXL7, ST18, NFIA, ITPKB, PLP1, LPAR1, BCAS1, CDH20 
##     ENPP2, CLMN, BCL6, DAAM2, TMEM144, CNDP1, SPP1, CTNNA3, KCNH8, GLUL 
##     ABCA8, AC026316.5, CDH19, HHIP, ARHGAP24, MOBP, SLC9A9, LINC01608, FOLH1, GRAMD2B 
## PC_ 2 
## Positive:  SLC1A2, ADGRV1, GPC5, RFX4, PRDM16, TPD52L1, RYR3, MRVI1, BMPR1B, PITPNC1 
##     FAM189A2, AQP4, SLC7A11, EFEMP1, FAM107A, CACHD1, GNA14, RNF219-AS1, EYA2, SLC4A4 
##     ATP1A2, ZNF98, MGST1, STON2, RANBP3L, PARD3, GLIS3, LINC00299, SPON1, GJA1 
## Negative:  ST18, CTNNA3, PLP1, SPP1, ENPP2, CNDP1, LPAR1, ZNF536, SPOCK3, BCAS1 
##     AC026316.5, MOBP, ABCA8, HHIP, PDE1C, TMEM144, APBB1IP, DOCK8, KCNH8, LINC01608 
##     FOLH1, RASGRP3, PIEZO2, PALM2, DOCK2, CDH19, PCSK6, FYB1, INPP5D, SLCO1A2 
## PC_ 3 
## Positive:  ERBB4, NXPH1, ZNF536, ZNF385D, GAD1, GRIK1, SPOCK3, GAD2, ADARB2, GRIP1 
##     KIAA1211, BTBD11, MYO16, DLX6-AS1, SGCZ, SLC24A3, PLD5, TENM1, ASTN2, DNER 
##     ANK1, VWC2, SLC35F4, PCDH15, KCNAB1, ALK, GRIP2, SLC6A1, TOX3, TENM3 
## Negative:  IPCEF1, DOCK8, APBB1IP, DOCK2, INPP5D, FYB1, C3, AC074327.1, SLCO2B1, RUNX1 
##     LYN, LRMDA, PALD1, ADAM28, LRRK1, RBM47, MYO1F, CSF1R, PTPRC, IKZF1 
##     PIK3AP1, TBXAS1, FLI1, SYK, ST6GAL1, RHBDF2, ARHGAP25, WDFY4, ATP8B4, SLC11A1 
## PC_ 4 
## Positive:  PTPRK, ST18, CTNNA3, GRM3, PLP1, ENPP2, CLMN, TMEM144, AC026316.5, CNDP1 
##     MOBP, BCAS1, ABCA8, DAAM2, HHIP, NECAB1, LINC01608, FOLH1, KCNH8, LPAR1 
##     ZNF365, SLCO1A2, PDE1A, SV2B, PDE1C, CDH20, CDH19, SATB2, SLC5A11, FBXL7 
## Negative:  APBB1IP, DOCK8, DOCK2, FYB1, LRMDA, C3, INPP5D, AC074327.1, KCNIP1, ST6GAL1 
##     NXPH1, ZNF385D, RBM47, CSF1R, ADAM28, SLCO2B1, PIK3AP1, IKZF1, SDK1, PTPRC 
##     SYK, PALD1, TBXAS1, MYO1F, FLI1, LYN, ATP8B4, RUNX1, TNS3, GAD1 
## PC_ 5 
## Positive:  AC004852.2, VCAN, MEGF11, LHFPL3, CA10, SEMA5A, AC124254.1, COL9A1, STK32A, SMOC1 
##     PTPRZ1, MIR3681HG, PCDH15, PLPP4, GALR1, BEST3, FERMT1, LUZP2, LINC02588, NTN1 
##     STK32B, PDZD2, CALCRL, LINC02283, AL445250.1, SULF2, CHST9, BRINP3, GPNMB, COL20A1 
## Negative:  ZNF536, KCNC2, SPOCK3, NEAT1, PTCHD4, GRIP1, CNTNAP3, ANK1, GAD2, MYO5B 
##     SAMD5, ELAVL2, AC010127.1, TENM3, AL033504.1, RUNX2, CNTN5, PRKG1, PLCXD3, AL445623.2 
##     GRM3, CLMN, TRPC4, CTNNA3, ST18, CNTNAP3B, ROR1, SLC35F4, ZNF385D, NHS
# visualize MDS plot
MDS_plot <- DimPlot(combined_seurat_filtered, reduction = "pca", group.by = "Condition")
MDS_plot

Figure 7: MDS plot grouped by condition. Y-axis is the second PC; X-axis is the first PC; Each dot indicate a single cell.

Note control and tbi group are clearly distinguished at four tip branched out from lower left bulk. Which implied transcriptional difference in the cells locate at this region.

Obtain normalized dataframe

NormalizeData()stored normalized counts in data layer. Simply call Seurat::GetAssayData() on data layer to obtain normalized counts, and convert to data.frame use as.data.frame().

After filter low quality cell at cleaning step, there are 21734 cells remain. After filter low expressed gene at normalization step, there are 25389 genes remain. Thus the shape of final resulting expression matrix is 25389 * 21734

# extract normalized data
norm_matrix <- GetAssayData(combined_seurat_filtered, layer = "data")

# Check dimensions (Genes x Cells)
dim(norm_matrix)
## [1] 25389 21734
# Check rowname
head(rownames(norm_matrix))
## [1] "AL627309.1" "AC114498.1" "AL669831.5" "FAM87B"     "LINC00115" 
## [6] "FAM41C"
# Check colname
head(colnames(norm_matrix))
## [1] "GSM6376803_AAACCCAGTTTCGGCG-1" "GSM6376803_AAACGAAAGACCATTC-1"
## [3] "GSM6376803_AAACGAACAGAGGTAC-1" "GSM6376803_AAACGAACATGTGGTT-1"
## [5] "GSM6376803_AAACGAATCGGATACT-1" "GSM6376803_AAACGAATCTGTCCGT-1"
# convert to data.frame
# Comment out to skip due to memory constrain
# The script below prompt to allocate 5.4 GM of RAM
# norm_df <- as.data.frame(norm_matrix)
# Check dimensions (Genes x Cells)
# dim(norm_df)

The final result is in dgCMatrix instead of data.frame due to memory constrain. Convert to data.frame require additional ~5.4 GB of RAM.

Differential gene expression